The textFile method reads a text file from HDFS/local file system/any hadoop supported file system. Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.
textFile(): Read single or multiple text, csv files and returns a single Spark RDD [String]
wholeTextFile(): Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
Read text File from HDFS
val rdd = sc.textFile("/FileStore/tables/orders.txt")
rdd.collect.foreach(f=>{println(f)})
rdd.collect.foreach(f=>{println(f)})
Read text File from Local file System
val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")
Read all text File from Local file System
If you want to read a entire contents of a file as a single record use wholeTextFiles() method on sparkContext.
val rdd = sc.textFile("/FileStore/tables/*")
rdd.collect.foreach(f=>{println(f)})
rdd.collect.foreach(f=>{println(f)})
RDD From File
val scalaFile = scala.io.Source.fromFile("/data/retail_db/products/part-00000").getLines.toList
val scalaFileRDD = sc.parallelize(productsRaw)
Word Count Program
val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")
val words = rdd.flatMap(x => x.split(" ")).map(x => (x,1))
val words = rdd.flatMap(x => x.split(" ")).map(x => (x,1))
val word_count = words.reduceByKey((x, y) => x + y)
word_count.collect()
word_count.collect()
No comments:
Post a Comment